A Hybrid Approach for Building Arabic Diacritizer

نویسندگان

  • Khaled Shaalan
  • Hitham Mohamed Abo Bakr
  • Ibrahim Ziedan
چکیده

Modern standard Arabic is usually written without diacritics. This makes it difficult for performing Arabic text processing. Diacritization helps clarify the meaning of words and disambiguate any vague spellings or pronunciations, as some Arabic words are spelled the same but differ in meaning. In this paper, we address the issue of adding diacritics to undiacritized Arabic text using a hybrid approach. The approach requires an Arabic lexicon and large corpus of fully diacritized text for training purposes in order to detect diacritics. CaseEnding is treated as a separate post processing task using syntactic information. The hybrid approach relies on lexicon retrieval, bigram, and SVM-statistical prioritized techniques. We present results of an evaluation of the proposed diacritization approach and discuss various modifications for improving the performance of this approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic Diacritization: Stats, Rules, and Hacks

In this paper, we present a new and fast state-of-the-art Arabic diacritizer that guesses the diacritics of words and then their case endings. We employ a Viterbi decoder at word-level with back-off to stem, morphological patterns, and transliteration and sequence labeling based diacritization of named entities. For case endings, we use Support Vector Machine (SVM) based ranking coupled with mo...

متن کامل

High capacity steganography tool for Arabic text using 'Kashida'

Steganography is the ability to hide secret information in a cover-media such as sound, pictures and text. A new approach is proposed to hide a secret into Arabic text cover media using "Kashida", an Arabic extension character. The proposed approach is an attempt to maximize the use of "Kashida" to hide more information in Arabic text cover-media. To approach this, some algorithms have been des...

متن کامل

Off-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model

In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...

متن کامل

Fuzzy Method in Feasibility Study of Using Biomass Solar Hybrid Source and Photovoltaic System for Designing a Research Center Building in Smart Grid of Hamedan

In this paper, the electrical consumption of a Research Center Building is derived of the bills, Then with defining of fuzzy Bus Thermal Coefficient objective function for heating, the amount of available heat selling to Research Center Building consumer that is produced by Biomass Solar Hybrid Source is calculated. With the capacity determination of Biomass Solar Hybrid Source to provide Therm...

متن کامل

The Reality of Arabic Fiction Translation into English: A Sociological Approach

English translations of texts associated with Arabic fiction remain largely unexplored from a sociological perspective. Drawing on Pierre Bourdieu’s sociology, this paper aims to examine the genesis of Arabic fiction translation into English as a socially situated activity. Works of Arabic fiction emerged in English translation in the early twentieth century. Since then, this intellectual field...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009